<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
    <meta name="keywords" content="remote machines, Git, GitHub" />
    <meta name="description" content="Lecture introducing remote machines and version control using Git." />
    <title>BCB 546X -- 17 Jan 2017</title>
    <style>
      @import url(https://fonts.googleapis.com/css?family=Droid+Serif);
      @import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
      @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);

      body {
        font-family: 'Droid Serif';
      }
      h1, h2, h3 {
        font-family: 'Yanone Kaffeesatz';
        font-weight: 400;
        margin-bottom: 0;
      }
      .remark-slide-content h1 { font-size: 3em; }
      .remark-slide-content h2 { font-size: 2em; }
      .remark-slide-content h3 { font-size: 1.6em; }
      .footnote {
        position: absolute;
        bottom: 3em;
      }
      li p { line-height: 1.25em; }
      .red { color: #fa0000; }
      .large { font-size: 2em; }
      a, a > code {
        color: rgb(249, 38, 114);
        text-decoration: none;
      }
      code {
        background: #e7e8e2;
        border-radius: 5px;
      }
      .remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
      .remark-code-line-highlighted     { background-color: #373832; }
      .pull-left {
        float: left;
        width: 47%;
      }
      .pull-right {
        float: right;
        width: 47%;
      }
      .pull-right ~ p {
        clear: both;
      }
      #slideshow .slide .content code {
        font-size: 0.8em;
      }
      #slideshow .slide .content pre code {
        font-size: 0.9em;
        padding: 15px;
      }
      .inverse {
        background: #272822;
        color: #777872;
        text-shadow: 0 0 20px #333;
      }
      .inverse h1, .inverse h2 {
        color: #f3f3f3;
        line-height: 0.8em;
      }
      /* Two-column layout */
      .left-column {
        color: #777;
        width: 20%;
        height: 92%;
        float: left;
      }
        .left-column h2:last-of-type, .left-column h3:last-child {
          color: #000;
        }
      .right-column {
        width: 75%;
        float: right;
        padding-top: 1em;
      }
    </style>
  </head>
  <body>
    <textarea id="source">

class: center, middle

# Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks
## Buffalo Chapter 12

---

## Introduction:

--

* Pipelines involve running a sequence of commands on one or many files

--

* Well-written pipelines are both _**robust**_ and _**reproducible**_

--

* What would be the characteristics of a robust and reproducible pipeline?

--

* Crafting a robust pipeline can be particularly challenging because it involves several commands applied to different files

--

This chapter introduces the following topics related to pipelines:

--

1. Bash shell scripts
	
--
	
2. File processing with `find` and `xargs`
	
--
	
3. Running parallel pipelines
	
--
	
4. Components of a simple makefile

---

## Basic Bash Scripting:

--

* Bash, in addition to being the shell through which we send commands to a computer's kernel, is also a full-fledged scripting language

--

* While Bash is more limited as a general purpose language then something like Python, it's great for "duct taping" commands together into a pipeline

--

* So what is a "Bash script"?

--

	* A series of commands organized into a rerunnable script with built in checks for file availability and errors
	
--

	* Bash scripts always have the extension _.sh_ and you can create them in a simple text editor
	
---

## The components of your Bash header:

--

A typical Bash header will look something like this:

```
#!/bin/bash
set -e
set -u
set -o pipefail
```
--

`#!/bin/bash`: known as the _shebang_ and indicates the path to the interpreter used to execute the script

--

`set -e`: prevents the shell script from proceeding if one of its commands fails

--

`set -u`: aborts a script if an unset variable is encountered

--

`set -o pipefail`: if your script contains pipes, the `set -e` command will only result in script termination if the last command in the pipe results in an error...this indicates any error in a pipe should cause exit

---

## Running a Bash script:

--

A Bash script can be run by:

--

* Using the `bash` program directly:

```
bash script.sh
```

--

* Calling the script as a program:

```
./script.sh
```
--

Calling the script as a program allows flexibility across interpreters (other shells), but executable permissions may first need to be given to the script to run:

```
$ chmod u+x script.sh
```

---

## Passing command-line arguments to your Bash script:

--

Create the following Bash script and save it as `args.sh`:

```
#!/bin/bash
set -e
set -u
set -o pipefail

echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3"
```

--

Now try running it as:

```
$ bash args.sh arg1 arg2 arg3
```

--

How is Bash passing the command line arguments to the script?

---

This can be utilized for creating informative error messages:

--

Modify your `args.sh` script as follows:

```
#!/bin/bash
set -e
set -u
set -o pipefail

if [ "$#" -lt 3 ] # are there less than 3 arguments?
then
        echo "error: too few arguments, you provided $#, 3 required"
        echo "usage: args.sh arg1 arg2 arg3"
        exit 1
fi

echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3"
```

--

Try running with three command-line arguments and fewer than three

---

Now let's create a Bash script that includes an `else` as part of our "if Statement"...

--

Save the following script as `genefind.sh`

```
#!/bin/bash
set -e
set -u
set -o pipefail

if grep "$1" ${2}.txt > /dev/null
then
    echo "found $1 in ${2}.txt"
else
	echo "did not find $1 in ${2}.txt"
fi
```
--

Take a look at the file `genes.txt` in the Week_14 folder and see if you can determine how to use this script

--

Thinking back to our UNIX exercises, can you think of how to modify this to search two files for the same pattern?

---

Here you go:

```
#!/bin/bash
set -e
set -u
set -o pipefail

if grep "$1" ${2}.txt > /dev/null &&
	grep "$1" ${3}.txt > /dev/null
then
    echo "found $1 in ${2}.txt and ${3}.txt"
else
	echo "did not find $1 in ${2}.txt and ${3}.txt"
fi
```
---

## Processing Files with Bash Using "for Loops" and Globbing:

--

We'll often want to run a Bash script across a number of files (_e.g.,_ files with different chromosomes' data)

--

In order to complete this iterative process we'll need to:

1. **Select** which files to apply the commands to

2. **Loop** over the data and apply the commands

3. **Keep track** of the names of any output files created

--

Files can be selected by either using information contained within another "guide" file or selecting files within a directory based on some criteria.

--

Let's look at an example that selects files based on information in a guide file:

---

```
#!/bin/bash
    set -e
    set -u
    set -o pipefail

# specify the input samples file, where the third
# column is the path to each sample FASTQ file
sample_info=samples.txt

# create a Bash array from the third column of $sample_info
sample_files=($(cut -f 3 "$sample_info"))

for fastq_file in ${sample_files[@]}
do
	# strip .fastq from each FASTQ file, and add suffix
	# "-stats.txt" to create an output filename for each FASTQ file
	results_file="$(basename $fastq_file .fastq)-stats.txt"

	# run fastq_stat on a file, writing results to the filename we've 
	# indicated above
	fastq_stat $fastq_file > stats/$results_file
done
```
--

This example takes a single input file, operates on the file, and produces a single output file

--

What if we need to operate across multiple files to produce a single output?

---

Here's an example:

```
#!/bin/bash
set -e
set -u
set -o pipefail
    
# specify the input samples file, where the third
# column is the path to each sample FASTQ file
sample_info=samples.txt

# our reference
reference=zmays_AGPv3.20.fa

# create a Bash array from the first column, which are
# sample names. Because there are duplicate sample names
# (one for each read pair), we call uniq
sample_names=($(cut -f 1 "$sample_info" | uniq))

for sample in ${sample_names[@]}
do
	# create an output file from the sample name
	results_file="${sample}.sam"
	bwa mem $reference ${sample}_R1.fastq ${sample}_R2.fastq \
	> $results_file
done
```
---

## Processing/Matching Files with `find` and `xargs`:

--

`find` can be used in a Bash script to match particular files and then `xargs` can be used to pass these files on to particular commands

--

`find` is different than `ls` in that it can match files recursively:

--

```
$ find zmays-snps
```
--

The depth to which `find` searches within your directory can be controlled with the `-maxdepth` option:

--

```
$ find zmays-snps -maxdepth 1
```

--

We can match particular patterns in files using the `-name` option of `find`:

--

```
$ find data/seqs -name "zmaysB*fastq"
```

--

Other than searching recursively, can you see differences in the results of `find` vs. `ls`?

---

We can also control what we match with `find` using the `type` option:

```
$ find data/seqs -name "zmaysB*fastq" -type f
```

--

How are these options of `find` being logically interpreted?

--

We can modify this:

--

```
$ find data/seqs -name "zmaysB*fastq" -or -type f
```

--

Negation can also be useful for matching files:

```
$ find data/seqs -type f

$ find data/seqs -type f -not -name "zmaysC*fastq"

$ find data/seqs -type f -not -name "zmaysC*fastq" -not -name "*-temp*"
```
---

Once we've matched a file, we can pass it along to a program using `xargs`

--

Let's create some temporary files that we'll then want to remove to see how this works:

```
$ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq
```
--

we can match them using `find`:

```
find . -name "*-temp.fastq"
```
--

But can you pipe this directly to `rm`?

--

Here's where `xargs` comes in handy:

```
find . -name "*-temp.fastq" | xargs rm
```

--

And finally with `xargs` you can launch multiple processes in parallel:

--

```
find . -name "fastq" -type f | xargs -P 6 fastq_stat
```


    </textarea>
    <script src="https://gnab.github.io/remark/downloads/remark-latest.min.js" type="text/javascript">
    </script>
    <script>
      var hljs = remark.highlighter.engine;
    </script>
    <script src="remark.language.js"></script>
    <script>
      var slideshow = remark.create({
          highlightStyle: 'monokai',
          highlightLanguage: 'tex',
          highlightLines: true
        }) ;
    </script>
    <script>
      var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-44561333-1']);
      _gaq.push(['_trackPageview']);

      (function() {
        var ga = document.createElement('script');
        ga.src = 'https://ssl.google-analytics.com/ga.js';
        var s = document.scripts[0];
        s.parentNode.insertBefore(ga, s);
      }());
    </script>
  </body>
</html>
